Skip to content

Conversation

@s0nskar
Copy link
Contributor

@s0nskar s0nskar commented Oct 7, 2025

What changes were proposed in this pull request?

  • Fix the WorkerStatusTracker logic, so unknown workers are marked correctly in excluded workers.
  • Trigger shuffle data lost if the worker hosting the shuffle data is lost.

This can be extended to –

  • fast fail mapper stages as well before the commit starts.
  • with push replicate enabled with multiple workers loss.

Why are the changes needed?

Currently even if worker crashs or became unavailable for some reason and marked as lost by Master, reduce stage still try to read data from it and fail after running for sometime which is in-efficient. We can detect this early and fail the reduce stage with SHUFFLE_DATA_LOST before starting the stage.

Does this PR introduce any user-facing change?

NA

How was this patch tested?

WIP

@s0nskar s0nskar marked this pull request as ready for review October 8, 2025 12:02
Copy link
Contributor

@cxzl25 cxzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SteNicholas SteNicholas requested a review from RexXiong October 20, 2025 02:07
Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@SteNicholas SteNicholas changed the title [CELEBORN-2166] Fastfail reduce stage if shuffle data is lost because of worker lost [CELEBORN-2166] Fast fail reduce stage if shuffle data is lost because of worker lost Oct 29, 2025
@SteNicholas
Copy link
Member

Merged to main(v0.7.0).

@turboFei
Copy link
Member

We can merge this PR into branch-0.6 as well, will update the config version.

turboFei pushed a commit that referenced this pull request Dec 30, 2025
…e of worker lost

- Fix the WorkerStatusTracker logic, so unknown workers are marked correctly in excluded workers.
- Trigger shuffle data lost if the worker hosting the shuffle data is lost.

This can be extended to –
- fast fail mapper stages as well before the commit starts.
- with push replicate enabled with multiple workers loss.

Currently even if worker crashs or became unavailable for some reason and marked as lost by Master, reduce stage still try to read data from it and fail after running for sometime which is in-efficient. We can detect this early and fail the reduce stage with SHUFFLE_DATA_LOST before starting the stage.

NA

WIP

Closes #3496 from s0nskar/CELEBORN-2166.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
(cherry picked from commit 1157d6a)
Signed-off-by: Wang, Fei <fwang12@ebay.com>
@turboFei
Copy link
Member

thanks, merged to 0.6.3 as well

turboFei added a commit that referenced this pull request Dec 30, 2025
…stOnUnknownWorker.enabled version to 0.6.3

### What changes were proposed in this pull request?

Update config celeborn.client.shuffleDataLostOnUnknownWorker.enabled version to 0.6.3

### Why are the changes needed?

Followup for #3496, it is better to merge into branch-0.6 as well.
### Does this PR resolve a correctness bug?

No.
### Does this PR introduce _any_ user-facing change?

No, it has not been releases yet.

### How was this patch tested?

GA.

Closes #3576 from turboFei/update_conf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
turboFei added a commit that referenced this pull request Dec 30, 2025
…stOnUnknownWorker.enabled version to 0.6.3

### What changes were proposed in this pull request?

Update config celeborn.client.shuffleDataLostOnUnknownWorker.enabled version to 0.6.3

### Why are the changes needed?

Followup for #3496, it is better to merge into branch-0.6 as well.
### Does this PR resolve a correctness bug?

No.
### Does this PR introduce _any_ user-facing change?

No, it has not been releases yet.

### How was this patch tested?

GA.

Closes #3576 from turboFei/update_conf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
(cherry picked from commit 38532d7)
Signed-off-by: Wang, Fei <fwang12@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants